finding interesting and available datasets is a concern for all those involved in data science.
Where to look?
google
github
kaggle
open government data portals
?
The search for analysis-ready datasets
if cleaning data is 70% of the data science process than the desire for analysis-ready data (even at the basic csv format) is priority #1 after addressing the business questions.
analysis-ready data also needs to include some type of documentation
The US open government data portal
examples of PDF reports
examples of the bad and the good
The reality and struggle
… are real
ttbbeer
An R data package of beer statistics from U.S. Department of the Treasury, Alcohol and Tobacco Tax and Trade Bureau (TTB)
The plan: liberate more beer statistics datasets from open U.S. government data portals to analysis-ready data frames for R and beyond
The method: web-scraping with rvest (the first dataset was copy/pasted to excel due to the limited R ecosystem of PDF parsing pacakges)
The dream: increase awareness of beer analytics and promote analysis-ready datasets from data.gov